attribute of R, and Q an SQL query that contains R.A in the SELECT clause! SIT(JMI0 
is the statistic for attribute A on the result of the executing query expression Q. Q is 
called the generating query expression of SIT (R.A\Q). This definition can be extended ■ 
for multi^attribute statistics. Furthermore, the definition can be used as the basis for 
extending the CREATE STATISTICS statement in SQL where instead of specifying the 
table : name of the query, more general query expression such as a table valued expression 
can be used, CQ<&*J-#C, , 



InU.S. Patent Application Serial No. 1 0/1 91 ,822, incorporated herein by iT~ 
reference in its entirety, the concept of SITs was introduced. A particular method of 
adapting a prior art query optimizer to access and utilize a preexisting set of SITs for cost 
estimation was described in detail in this application, which method is summarized here 
briefly as background information. . ' ' 

Referring to Figure 2, the query optimizer examines an input query and generates 
a query execution plan that most efficiently returns the results sought by the query in 
terms of cost. The cost estimation module and its imbedded cardinality estimation 

module can be modified to utilize statistics on query expressions, or intermediate tables^ ^£S 
to improve the accuracy of cardinality estimates. S£J» 

.In general, the use of SITs is enabled by implementing a wrapper (shown in 
phantom in Figure 2) on top of the original cardinality estimation module, of the RDBMS 



Q 



Q 

During the optimization of a single query, the wrapper will be called many times, once Q 

O 

optimizer invokes the modified cardinality estimation module with a query plan, this Q 



for each different query sub-plan enumerated by the optimizer. Each time the que^y 



input.plan is transformed by the wrapped into another one that exploits SITs. The 



T3 



cardinality estimation module uses the input plan to arrive at a potentially more accurate 
cardinality estimation that is returned to the query optimizer. The transformed query plan 
is thus a temporary structure used by the modified cardinality and is not used for query 

execution. . is^Osf^^^^l^l 

According to the embodiment described in application serial number 10/191,822 * f!Ct — - 
the transformed plan that is passed to the cardinality estimation module exploits 
applicable SITs to enable a potentially more accurate cardinality estimate. The original 
cardinality estimation module requires little or no modification to accept the transformed • ' 
plan as input. The transformation of plans is performed efficiently, which is important 
because the transformation will be used for several sub-plans for a single query 
optimization, ;. : 

In. genera^ there will be no SIT that maLtches a given plan exactly. Instead, . 
several SITs' might be used for to some (perhaps overlapping) portions of the input plan. -i <? * n 

The embodiment described in application serial number 10/191,822^integrates SITs. with 
cardinality estimation routines by transforming the input plan into an equivalent one that 
exploits SITs as much as possible. The transformation step is based on a greedy 
procedure that selects which SITs to apply at each iteration* so that the number of 
independence-assumptions during the estimation for the transformed query plan is 

minimized. Identifying whether or not a SIT is applicable to a given plan leverages £| 

mamma 

materialized view matching techniques as can be seen in the following example. Q 

cr 

In the query shown in Figure 3(a) RxSandRxT are (skewed) foreign-key 
joins. Only a few. tuples in S and T verify predicates <Ts.a<io(S) and Gzb>2o{T) and most ^) 
tuples in R join precisely with these tuples in S and T. . In the absence of SITs, ; *Q 3 



Therefore error values must be estimated using efficient and coarse mechanisms. 
Existing information such as system catalogs or characteristics of the input query can be 
used but not additional information created specifically for such purpose. 

IS A*** US p4ewh^ '<o,%1, %1 
Application serial number 10/191,822 introduced an error function, nlnd> that is 

simple and intuitive, and uses the fact that the independence assumption is the main . 

source of errors during selectivity estimation. The overall error of a decomposition is 

defined as S = Sel^JtPAQi) •. . Sel^P„\Q n ) when approximated, respectively, using- 

H^/A/IQ'/),. . H^K\Q\) (Q ') Q Qi), as the total number of predicate independence 

assumptions during the approximation, normalized by the maximum number of 

independence assumptions in the decomposition (to get a value between 0 and 1). In 

symbols, this error function is as follows: tn 

■ \ ■ . 5 

Each term in the numerator represents the fact that Pi and 0, - Q ) are independent 



with respect io Q iy and therefore the number of predicate independent assumptions is 
\P*\ m I Qi 1 Q VI- 'In turn, each term in the denominator represents the maximum number 
of independence assumptions when Q ) = 0, i.e | P,- 1 ■ | £?,|. As a very simple example, 
consider S - Sel R (R.a<10,R,b>50) and decomposition 5 = Sel R (R.a<10\R.b>50) • 
Sel R {R,b>50). If base table histograms H(R. a) and H(R.b) are used, the error using nlnd 

is ^ + | * I» ■• e -» one out one independence assumptions (between 
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